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Abstract.  This  paper  introduces  the  first  ever  Geiger-mode  ladar  processing  al¬ 
gorithm  that  is  suitable  for  implementation  on  an  FPGA  enabling  real  time  pro¬ 
cessing  and  data  downlink.  Current  airborne  Geiger-mode  ladar  systems  obtain 
high  resolution  and  wide  area  coverage,  but  require  sensors  and  processing 
counterparts  with  large  size,  weight,  and  power  (SWaP)  requirements.  We  are 
developing  an  airborne  “Micro-ladar”  system  that  collects  up  to  1  billion  sam¬ 
ples/sec,  weighs  on  the  order  of  250  grams,  consumes  no  more  than  20W  and 
fits  in  a  package  not  much  larger  than  a  coffee  cup  (320cc).  To  reduce  SWaP, 
we  developed  embedded  FPGA  real  time  processing  algorithms  that  take  noisy 
raw  data,  streaming  at  upwards  of  lGB/sec,  and  filters  the  data  to  obtain  a  near¬ 
ly  noise-free  3D  point  cloud  with  high  compression  rates,  resulting  in  a 
2MB/sec  output  data  rate  that  can  be  readily  downlinked  to  the  ground  over 
typical  UAV  communication  links.  A  physical  64x256  Geiger-mode  ladar  array 
was  integrated  with  an  FPGA  processing  board  running  a  baseline  processing 
algorithm  where  the  processing  board  has  8  orders  of  magnitude  lower  SWaP 
(m3kgW)  than  typical  airborne  ladar  processing  systems.  Quantitative  results 
using  simulated  ladar  data  input  indicate  that  the  new  FPGA  algorithm  produc¬ 
es  data  with  quality  comparable  with  previous  state  of  the  art  3D  ladar  pro¬ 
cessing  algorithms  while  suffering  a  reduction  in  area  coverage  rates. 


1  Introduction 

3D  ladar  imaging  sensors  are  critical  enablers  for  autonomous  navigation  systems 
in  robotic  and  unmanned  aerial  vehicle  (UAV)  applications,  as  well  as  for  airborne 
wide-area  3D  terrain  mapping  and  foliage  penetration  (FoPen)  missions  in  support  of 
national  security  (Albota  M.  H.,  2002).  In  this  paper,  we  develop  the  first  ever  Geiger¬ 
mode  ladar  processing  algorithm  implemented  on  a  FPGA.  By  utilizing  FPGAs  we 
are  able  to  achieve  a  “micro-ladar”  system  that  resides  in  a  design  space  not  previous¬ 
ly  explored.  It  has  sampling  rates  on  the  same  order  scale  as  the  much  larger  airborne 
ladar  system,  ALIRT,  but  operates  at  shorter  ranges  measured  in  hundreds  of  meters 
instead  of  kilometers  (Knowlton,  2011). 

The  rest  of  the  paper  is  organized  as  follows.  In  section  2  we  review  related  work 
that  is  essential  in  the  implementation  process.  Section  3  discusses  the  different  FPGA 
algorithm  implementations.  In  Section  4  we  present  our  implementation  and  results 
including  FPGA  complexity  studies  and  algorithm  performance  results.  Detailed 
FPGA  utilization  reports  are  generated  for  current  operating  parameters  and  extrapo¬ 
lated  for  future  system  upgrades.  Simulated  ladar  data  is  processed  using  the  FPGA 
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and  compared  to  the  Matlab  output  (Michael  E.  O’Brien,  2005).  Section  5  concludes 
with  a  discussion  of  the  lessons  learned  and  directions  for  future  work. 


2  Related  Work 

Three-dimensional  Laser  Radar  (3-D  Lidar)  sensors  output  range  images,  which 
provide  explicit  3-D  information  about  a  scene  (Heinrichs,  2001).  Most  3D  imaging 
for  robotics  rely  on  synchronous  time  of  flight  (TOF)  focal  plane  arrays  (FPAs),  with 
one  example  of  that  being  the  Microsoft  Kinect  sensor.  Such  sensors  typically  work 
up  to  10m  in  range,  limited  lighting  conditions  (indoor),  and  take  on  the  order  of  100k 
to  1  million  measurement  samples  per  second.  Pulsed  TOF  FPAs,  such  as  Velodyne 
Lidar,  are  used  for  precision  surveys  and  autonomous  vehicle  navigation,  with  ranges 
upwards  of  1km  and  100k  to  1  million  measurement  samples  per  second.  Next  up  are 
pulsed,  flying  spot  single  pixel  linear  mode  sensors,  with  increased  range  to  target, 
but  still  limited  measurement  rates  fewer  than  1  million  samples  per  second.  Such 
systems  are  used  for  mapping  applications,  however  their  area  coverage  rates  border 
on  the  low  side  for  collecting  high  resolution  imagery  over  large  regions.  Figure  1 
provides  some  background  on  where  prior  3D  imaging  systems  map  in  terms  of 
measurement  sampling  rates  versus  range  to  target. 
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Figure  1.  Background  on  where  prior  3D  imaging  systems  map  in  terms  of  measurement  sam¬ 
pling  rates  versus  range  to  target. 

Pulsed  Geiger-mode  APDs,  especially  combined  with  scanning  hardware,  offer  a 
significant  step-up  in  both  measurement  rates  (100  mil  samples  per  second)  and  range 
to  target  (10km+),  thanks  to  ever  larger  multi-pixel  arrays  as  well  single-photon  sensi¬ 
tive  detection  (Albota  M.  A.,  2002)  (Aull,  2002).  An  example  of  such  a  system  is 
ALIRT  (~  10-20  million  samples/sec,  7km  range)  (Knowlton,  2011). 

The  proposed  “micro-ladar”  system  resides  in  a  location  not  previously  explored  in 
this  parameter  space.  It  has  sampling  rates  on  the  same  order  scale  as  ALIRT,  but  at 
shorter  ranges  to  target,  hundreds  of  meters  instead  of  kilometers  (Marino,  2003).  To 
be  able  achieve  this  measurement  rate  while  requiring  a  many-order  reduction  in  size- 


weight  and  power  (SWaP)  necessitates  a  radical  rethink  of  the  algorithms  used  to 
process  the  data. 


3  FPGA  Implementation  of  Noise  Filtering  Algorithm 

In  this  section  we  introduce  and  implement  novel  Geiger-mode  ladar  de-noising 
algorithms  (PixelCP)  developed  for  Field  Programmable  Gate  Arrays  (FPGAs). 


3.1  Algorithm  Design 

A  new  set  of  3D  ladar  noise  filtering  algorithms  must  be  developed  in  order  to 
achieve  such  a  drastic  reduction  in  SWaP  while  maintaining  performance.  The  algo¬ 
rithms  were  first  implemented  in  Matlab  where  multiple  modules  of  varying  complex¬ 
ity  were  developed,  leading  to  a  large  number  of  possible  module  combinations  (384 
variants).  Figure  2  captures  a  high-level  description  of  each  processing  step  as  well  as 
implemented  modules.  This  paper  explores  the  implementation  of  a  modified  “base¬ 
line”  algorithm. 
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Figure  2.  PixelCP  Variants. 

For  the  first  step,  transforming  the  data,  we  have  two  options:  motion  compensa¬ 
tion  enabled  or  disabled.  For  the  no-motion  compensation  option,  we  are  assuming 
the  sensor  movement  is  negligible  and  that  each  APD  pixel  captures  the  same  scene 
location.  The  data  is  transformed  from  raw  range  data  directly  to  a  3D  angle-angle- 
range  (AAR)  space  that  is  defined  in  angle-angle  by  the  APD  row-col  information.  In 
essence,  we  are  taking  the  blur  hit  due  to  any  jitter  or  unexpected  movement  during 
that  integration  time.  For  the  motion  compensation  case,  we  developed  a  method  that 
is  computationally  much  less  complex  that  the  transform  codebase  (Vasile,  2012) 
used  on  prior  3D  ladar  sensors.  The  reason  for  this  is  that  such  transforms  require  a  lot 
of  double  precision  floating  point  math,  which  are  not  well  suited  to  FPGA  imple¬ 
mentation.  Thus,  we  developed  an  approximate  transform  method,  where  we  define  a 
3D  angle-angle-range  space  in  the  APD  row-col-range  space  of  the  first  frame  for  a 
given  integration  period.  The  rest  of  the  APD  data  frames  within  an  integration  inter¬ 
val  are  motion  compensated  compared  to  first-frame  AAR  space,  by  first  computing 
the  relative  rotation-translation  compared  to  the  first  frame,  and  then  applying  that 
projection  matrix  to  just  the  center  line  of  sight  (LOS)  at  average  range  for  that  cur- 


rent  frame,  in  order  to  obtain  an  delta  row-col-range  translation.  This  motion  compen¬ 
sation  transformation  from  raw  stream  to  a  3D  AAR  space  is  exact  for  the  center  line 
of  sight,  but  only  approximate  for  the  remaining  pixels  under  certain  conditions.  Giv¬ 
en  no  sensor  translation  and  just  roll-pitch  rotational  changes  (due  to  mirror  scanning, 
platform  attitude),  the  3D  transformation  is  equivalent  to  a  full  per-pixel  transform. 
With  sensor  translation  and/or  yaw  rotation  (rotation  around  the  center  LOS),  the 
motion  compensation  transform  is  only  an  approximation  of  the  full  per-pixel  trans¬ 
form.  For  translation-only  motion,  these  approximations  are  present  due  to  variation 
in  range  position  and  are  negligible  considering  that  the  range  to  target  (300m)  and 
range  gate  (~100m)  are  much  bigger  than  the  platform  movement  (15m/sec)  over  an 
integration  time  upwards  of  100ms,  which  leads  to  1.5m  of  movement.  In  essence,  3D 
Cartesian  translation  errors  can  be  well  approximated  as  3D  AAR  space  translations 
through  the  use  of  small  angle  approximations.  However,  the  motion  compensation 
approximation  is  very  sensitive  to  yaw-rotations  (around  the  center  LOS)  with  the 
array  size  amplifying  the  effect:  even  for  small  sub-pixel  pitch  angles  (<100uR),  the 
lever  arm  effect  for  an  APD  comer  pixel  compared  to  the  center  pixel  can  lead  to 
significant  blur.  For  example,  a  50uR  (1/2  pixel  error)  yaw  rotation  would  lead  to  a 
blur  on  the  comer  of  a  256x256  APD  array  of  50uR  *  (1282+1282). 5  /100uR  =  90 
pixel  error.  Considering  that  most  attitude  changes  for  a  Puma-like  fixed  wing  UAV 
occur  in  roll-pitch  rather  than  yaw,  and  mirror  scanning  typically  can  be  expressed  as 
purely  roll-pitch  changes,  the  sensitivity  to  yaw  changes  is  not  concerning.  If  we 
decide  to  use  a  different  platform,  such  as  a  quad-copter,  the  motion-compensation 
might  need  to  be  improved.  For  now,  we  will  implement  this  approximate  motion 
compensation  technique,  as  the  computation  is  simple  and  the  approximation  holds 
well  for  our  target  platform. 

Now  that  we  have  data  converted  to  a  3D  AAR  space,  we  attempt  to  automatically 
distinguish  signal  from  noise  using  the  spatial  coincidence  of  points.  The  concept  is 
fairly  simple  to  grasp:  the  more  3D  points  returned  at  the  same  spatial  location,  the 
more  it  is  likely  that  the  points  came  from  a  real  scene  surface  as  opposed  to  “hits” 
due  to  background  light  or  dark  noise.  To  detect  spatial  coincidence,  we  bin  data  into 
3D  voxels  of  various  volumes  in  terms  of  pixel-pixel-range  units.  Smaller  bin  sizes 
allow  for  higher  frequency  data  to  be  captured,  at  the  expense  of  increased  ability  to 
detect  signal  from  noise  and  lower  memory  storage  requirements.  Larger  bins  recover 
signal  better  in  present  of  noise  and  lead  to  lower  memory  storage,  but  lead  to  reduc¬ 
tion  in  3D  data  fidelity.  The  best  of  both  worlds  (high  fidelity  and  good  SNR)  can  be 
achieved  by  convolution. 

For  small  bin  sizes,  where  we  have  high  3D  fidelity,  the  SNR  is  typically  not 
enough  to  discriminate  signal  from  noise.  Convolution  is  used  to  recover  spatial  coin¬ 
cidence,  while  still  preserving  3D  data  fidelity.  For  small  bin  sizes,  3D  convolution  is 
applied,  where  a  3D  matched  kernel  is  convolved  with  the  3D  binned  array.  This 
matched  kernel  typically  takes  on  the  shape  of  the  system  point  spread  function 
(PSF),  which  captures  sensor  uncertainties  in  the  angle-angle  and  range  directions. 
The  kernel  can  be  a  full  3D  kernel,  but  might  also  consist  of  just  a  2D  angle-angle 
kernel,  or  potentially  just  a  ID  range  kernel,  depending  on  the  angular  error  to  angular 
bin  size  ratio  and  range  error  to  range  bin  size  ratio. 

Once  coincident  information  is  available,  we  can  now  detect  valid  signal  as  3D 
peaks.  The  peak  estimation  algorithm  has  four  algorithm  variants  to  choose  from. 


The  first  is  single  peak  option  and  fixed  threshold.  In  this  method,  a  predetermined 
noise  threshold  is  set  for  the  minimum  bin  value.  The  values  for  each  unique  angle- 
angle  bin  are  considered,  to  extract  a  ID  histogram  in  the  range  direction.  For  each 
histogram,  the  index  that  contains  the  maximum  bin  value  is  recorded  as  the  single¬ 
peak  range  position,  with  the  bin  value  directly  determining  the  3D  point  reflectivity. 
The  result  is  a  2  Vi  D  single  peak  range  image.  The  adaptive  threshold  and  single  peak 
method  is  similar  to  the  first  approach,  but  now  adaptively  determines  a  threshold 
based  on  overall  signal  integration,  bin  size  as  well  current  data  strength.  The  third 
algorithm,  a  fixed  threshold  multi-peak  method,  outputs  both  the  first  peak  as  well  as 
the  last  peak  above  a  certain  threshold,  leading  to  a  2-peak  solution.  The  most  ad¬ 
vanced  algorithm  uses  a  non-maxima  suppression  technique  to  identify  peaks  maxima 
and  ignore  peak-begin  and  -end  tails,  and  outputs  all  such  peak  locations  above  an 
adaptive  threshold. 

The  peak  estimation  algorithm  ties  closely  with  the  peak  detection  algorithm.  The 
first  method  considers  the  index  for  the  maximum  bin  value  as  the  resulting  output 
range  position.  This  leads  to  gridding  and  discretization  in  range  results,  with  flat 
surfaces  appearing  jagged.  The  second  method  takes  into  account  neighboring  angle- 
angle-range  bins  to  obtain  a  floating  point  estimate  in  both  angle-angle  and  range, 
removing  some  of  the  gridding  artifacts. 

Lastly  the  reflectivity  algorithm  has  three  methods  to  choose  from.  The  first  meth¬ 
od  assigns  the  reflectivity  based  directly  on  the  bin  value.  The  second  method  takes 
into  account  integration  time  to  obtain  what  is  better  known  as  “probability  of  detec¬ 
tion”.  The  third  method  is  based  off  of  the  second  method,  but  takes  into  account  the 
effects  of  APD  and  photon  blocking  effects  for  surfaces  under  foliated  conditions. 

In  order  to  evaluate  algorithm  performance  and  capture  image  quality  as  a  function 
of  different  data  collection  parameters,  we  need  quantitative  image  quality  metrics. 
For  this  analysis  we  observe  angular  resolution,  range  precision,  impulse  detection, 
edge  sharpness,  FOPEN  ground  coverage,  and  SNR. 


3.2  Single  Peak  Detection 

We  first  implemented  a  very  basic  PixelCP  algorithm  directly  in  VHDL.  For  this 
we  selected  module  variants  that  would  be  easily  parallelized  and  implemented  on  a 
FPGA.  We  chose  not  to  incorporate  motion  compensation  in  our  implementation 
because  its  complexity  and  benefits  where  deterministic  of  the  devices  used  for  meas¬ 
uring  motion.  For  binning  data  we  used  a  uint  to  uint  operations  while  using  eight  by 
eight  “macro-pixels”.  This  meant  that  each  group  of  64  pixels  would  produce  one 
output  image  pixel.  In  the  next  step  we  chose  to  forgo  convolution  as  it  is  the  most 
complex  operation.  For  the  peak  detection,  we  used  fixed-threshold,  single-peak,  and 
max-bin  index  methods.  Lastly  we  implemented  the  normalized  peak  value  algorithm 
for  the  reflectivity  estimation. 

The  first  step  was  to  think  of  a  way  to  highly  parallelize  this  combination  of  Pix¬ 
elCP  modules.  We  found  that  each  “macro-pixel’s”  operations  were  done  completely 
independently  of  each  other  and  therefore  would  be  the  best  place  to  start  parallelizing 
the  algorithm.  Next,  the  range  value  from  each  of  the  64  pixels  inside  the  macro-pixel 
is  used  as  the  address  for  the  bin  memory.  The  value  of  each  pixel  is  read  sequentially 


as  the  address  for  the  bin  memory.  The  value  at  that  address  is  then  incremented  by 
one  and  written  back  into  the  bin  memory  at  that  same  address.  Additionally,  when 
the  value  in  the  bin  memory  is  incremented  by  one  it  is  also  compared  to  the  last 
known  highest  value.  If  the  new  value  is  equal  to  or  greater  than  the  previous  value,  it 
is  stored  in  an  output  memory  block,  “Max  Hits”.  We  then  check  to  see  if  the  value 
has  been  changed  and  if  so  the  bin  address  that  corresponds  to  that  value  is  also  stored 
in  another  output  memory  block,  “Max  Range”.  A  block  diagram  of  the  operations 
performed  on  one  macro-pixel  can  be  seen  below  in  Error!  Reference  source  not 
found.. 


Over  the  entire  array  256  individual  macro-pixels  are  operated  on  in  parallel  and 
therefore  produce  one  output  as  quickly  as  possible.  In  addition  to  parallelizing  the 
macro-pixels  we  also  implemented  the  firmware  to  operate  as  data  was  being  read  in 
from  the  array.  For  example  if  the  array  was  reading  data  from  four  quadrants  at  the 
same  time  each  quadrant  would  be  operated  on  as  the  data  flowed  in  real  time.  This 
allows  us  to  pipeline  the  processing  with  the  data  input  so  that  there  is  no  delay  from 
data  gathered  to  processed  output. 

In  order  to  simplify  timing  and  synchronization  between  the  FPGA  and  the  APD 
the  FPGA  clock  is  operated  at  125  MHz,  the  same  rate  as  the  APD  data  readout.  We 
can  then  use  our  knowledge  of  FPGA  components  to  calculate  an  expected  comple¬ 
tion  time.  Reading  all  64  pixels  into  memory  takes  2  clocks  per  pixel,  for  a  total  of 
128  clocks.  Incrementing  the  value  and  storing  it  back  in  memory  takes  another  2 
clocks.  All  together  this  makes  256  clocks  per  output.  If  all  the  macro-pixels  are  oper¬ 
ated  on  in  parallel  then  the  output  should  be  ready  in  2.048  microseconds.  This  is  far 
faster  than  the  20  kilohertz  (50  microseconds)  frame  rate  of  the  system. 


3.3  Multi-Peak  Detection 

Next,  we  implemented  a  second/multi-peak  implementation  in  order  to  achieve  fo¬ 
liage  penetration.  We  came  up  with  a  method  that  identified  any  number  of  additional 
peaks  without  the  need  to  reprocess  the  input  data.  The  first  peak  would  be  calculated 
exactly  identically  to  the  single  peak  version  and  then  the  bin  memory  would  be  re¬ 
processed  to  produce  a  second  peak. 


The  value  of  the  max  range  output  of  the  first  peak  is  used  as  a  starting  address  in 
the  bin  memory.  From  that  point  a  predetermined  range  offset  is  subtracted  from  the 
address  as  to  avoid  any  nearby  false  peaks.  The  value  at  this  new  memory  address 
(range)  is  compared  with  the  last  known  highest  value.  If  the  new  value  is  equal  to  or 
greater  than  the  previous  value,  it  is  stored  in  a  second  output  memory  block,  “Max 
Hits”.  If  the  value  has  been  changed  the  bin  address  that  corresponds  to  that  value  is 
also  stored  in  another  output  memory  block,  “Max  Range”.  A  block  diagram  of  the 
operations  performed  on  one  macro-pixel  can  be  seen  below  in  Figure  4. 

8X8  pixels  =  1  Macro  pixel 


Figure  4.  Multi-peak  block  diagram. 

This  method  preserves  the  parallelized  macro-pixels  that  were  created  in  the  first 
peak.  We  can  then  repeat  this  algorithm  to  allow  for  any  number  of  additional  peaks 
to  be  detected.  However,  these  additional  peaks  require  that  the  previous  peak  must  be 
fully  processed  and  therefore  sequentially  add  time  to  the  total  process. 

In  order  to  simplify  timing  and  synchronization  between  the  FPGA  and  the  APD 
the  FPGA  clock  is  operated  at  125  MHz,  the  same  rate  as  the  APD  data  readout.  We 
can  then  use  our  knowledge  of  FPGA  components  to  calculate  an  expected  comple¬ 
tion  time.  For  the  second  peak  not  all  64  pixels  must  be  read  from  memory.  The  num¬ 
ber  can  fluctuate  depending  on  the  location  of  the  first  peak.  If  we  assume  that  the 
first  peak  is  in  the  upper  half  of  the  range  extent  then  we  can  also  assume  approxi¬ 
mately  50  percent  of  the  pixels  will  be  read  out.  In  this  case  there  are  only  a  total  of 
128  clocks  per  output.  If  all  the  macro-pixels  are  operated  on  in  parallel  then  the  out¬ 
put  should  be  ready  in  1.024  microseconds.  This  is  far  faster  than  the  20  kilohertz  (50 
microseconds)  frame  rate  of  the  system. 


4  Results 

The  following  sections  report  on  FPGA  utilization,  algorithm  timing,  and  compare 
the  outputs  with  known  good  outputs  from  Matlab.  Reports  were  generated  using 
Xilinx  and  Questa  Modelsim  development  tools.  All  results  were  conducted  on  Kin- 
tex  7s  with  a  clock  speed  of  125  MHz  64  by  256  array  operating  frame  rate  of  20. 


4.1  Timing  Analysis 

We  closely  analyzed  Modelsim  simulations  to  determine  the  performance  of  both 
the  single  and  multi -peak  algorithms.  For  the  single  peak  algorithm  we  found  that  the 
processing  could  be  conducted  as  the  raw  data  was  read  in.  This  removed  the  wait  for 
incoming  data  to  finish  before  processing.  Additionally,  we  reduced  the  number  of 
macro-pixels  operating  in  parallel  as  most  were  idle  while  waiting  for  data  to  come  in. 
This  method  reduced  FPGA  complexity  while  maintaining  timing;  however  it  did 
serialized  some  of  the  processing  steps.  Below  is  a  timing  diagram  that  shows  the 
algorithm  implementation  on  a  FPGA.  The  timing  markers  in  this  figure  are  used  as 
approximate  time  identifiers  and  not  for  exact  timing  values. 
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Figure  5.  Multi-peak  Timing. 


When  looking  at  Figure  5,  the  signal  “framecount”  identifies  the  current  frame 
that  is  being  processed  and  “pixcnf  ’  identifies  the  computation  of  the  first  peak.  The 
timing  marker  on  the  far  left  (231886394  picoseconds)  identifies  the  start  of  data 
being  read  into  the  FPGA.  The  second  marker  (268797322  picoseconds)  identifies  the 
37  microseconds  that  it  takes  to  read  the  data  in  and  process  the  first  peak  output. 
While  the  timing  is  significantly  longer  then  what  we  had  calculated  in  section  3.2,  it 
still  exceeds  system  requirements. 

For  the  multi-peak  implementation  we  chose  to  build  off  of  the  single  peak  imple¬ 
mentation.  Keeping  the  partially  serialized  macro-pixels  did  slow  the  second  peak 
output,  but  still  falls  well  within  the  timing  requirements  of  the  system.  In  Figure  5 
the  second  peak  processing  is  identified  by  the  signal  “binaddress”.  The  8.2  micro¬ 
seconds  between  the  second  marker  (268797322  picoseconds)  and  the  third  marker 
(277045162  picoseconds)  shows  the  generation  of  the  second  peak  output  directly 
after  the  first  peak  processing  finishes.  After  both  the  first  and  second  peak  are  both 
calculated  there  is  still  another  8  microseconds  of  idle  time  before  the  next  frame  is 
ready  to  process.  Although  we  didn’t  implement  the  functionality  there  is  almost 
enough  time  to  perform  a  third  peak  calculation  and  still  meet  system  requirements. 


4.2  Algorithm  Performance 

We  defined  system  requirements  in  terms  of  3D  spatial  fidelity  of  processing  re¬ 
sults.  Prior  studies  of  human  interpretability  of  foliated  scenes  performed  for 
FALCON-I  as  well  as  ALIRT  indicate  an  x-y  post  spacing  of  25cm  or  higher  fidelity 
is  needed  to  accurately  identify  man-made  structures  versus  vegetation  clutter.  Thus, 
we  require  that  3D  snapshot  products  retain  features  on  the  order  of  25cm  x-y  spatial 
resolution.  Current  technology  enables  a  range  resolution  around  25cm,  which  sets  the 
desired  range  resolution  requirement.  Thus,  we  require  a  spatial  resolution  of  25cm  in 
all  three  dimensions.  MAPCP  and  MPSCP 


Table  1.  PixelCP  vs.  MPSCP  Performance 
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Next,  we  tested  the  single  and  multi-peak  design  on  a  simulated  data  set.  A  small 
snapshot  of  the  simulated  raw  input  data  can  be  seen  below  in  Figure  6. 


Figure  6.  Raw  Simulation  Data. 


The  data  set  ran  through  the  original  Matlab  code  as  well  as  a  Modelsim  simulation 
and  the  output  images  were  compared  using  Matlab' s  point  cloud  tool.  The  outputs  of 
both  can  be  seen  below  in  Figure  7. 


Figure  7.  Matlab  First  Peak  Value  (Left)  vs.  FPGA  Single/Multi  Peak  Value  (Right). 

Upon  inspecting  the  data  sets  we  found  small  differences  in  the  output  images. 
However,  with  further  investigation  the  two  algorithms  addressed  the  ambiguity  of 
similar  hit  values  in  different  range  bins.  In  this  situation  the  Matlab  algorithm  would 
select  the  bin  with  range  50m,  however  the  FPGA  algorithm  would  select  the  20m 
range  bin.  This  difference  in  algorithms  is  neglected  as  it  still  returns  extremely  simi¬ 
lar  results. 

With  verification  that  both  the  single  and  multi-peak  implementations  met  timing 
and  correctly  generated  output  products  on  simulate  data  the  next  step  was  to  test  the 
algorithm  on  a  live  data  stream.  For  this  experiment  the  system  was  set  up  with  a  64 
by  256  GmAPD  operating  at  a  20  kHz  frame  rate.  Using  a  PicoQuant  laser  source  a 
single  pin  hole  was  illuminated  so  that  approximately  10  percent  of  the  photodiodes 
were  firing  at  any  given  time.  This  data  was  read  out  and  processed  by  the  FPGA  in 
real  time.  The  raw  input  and  output  products  of  both  the  first  and  second  peak  can  be 
seen  below  in  Figure  8. 


Figure  8.  Live  FPGA  performance  from  left  to  right:  raw  data  input,  first  peak  range,  first  peak 
hit  count,  second  peak  range,  and  second  peak  hit  count. 

The  image  labeled  “Max  Bin”  is  the  range  output  of  the  first  peak;  it  clearly  shows 
a  significant  difference  in  range  directly  around  the  center  of  the  image  (near  the  pin 
hole).  The  next  image  to  the  right  displays  the  hit  counts  associated  with  that  range;  a 
clear  image  of  the  laser  profile  can  be  seen.  The  fourth  and  fifth  images  show  the  max 
range  and  hit  count  of  the  second  peak.  However,  since  only  one  laser  pulse  is  present 
during  each  frame  the  data  in  these  outputs  is  an  artificial  product  of  random  noise  on 
the  array. 


4.3  FPGA  Utilization 

Furthermore,  we  generated  FPGA  utilization  reports  using  Vivado  to  determine 
what  FPGA  could  accommodate  these  algorithm’s  demands.  We  elected  to  skip  the 
single  peak  only  version  due  to  the  algorithm  and  timing  performance  of  the  multi¬ 
peak  version.  Vivado’ s  report  gives  us  a  detailed  breakdown  of  FPGA  usage  in  terms 
of  bits  used  in  each  of  the  main  FPGA  blocks  (Flip  Flops,  LUT,  LUTRAM,  BRAM, 
10,  GT,  BUFG,  CMT,  and  PCIe  interfaces).  As  shown  in  Figure  9  the  report  was 
generated  for  a  256  by  64  size  array  and  compared  with  a  Kintex  7  FPGA. 


Multi-Peak  PixelCP  on  K7-410  FPGA 


Figure  9.  Multi-Peak  PixelCP  on  K7-410. 

From  this  report  we  can  see  that  the  multi-peak  implementation  is  utilizes  signifi¬ 
cantly  less  of  the  FPGA  in  every  category  compared  to  the  convolution  method.  The 
PCIe  utilization  is  maxed  out  at  100  percent  because  the  specific  FPGA  we  selected 


only  has  one  port  available.  For  the  system  we  designed  this  number  would  not  ex¬ 
pand  with  more  complicated  algorithms  or  larger  arrays  and  therefore  can  be  ignored. 

It  is  apparent  from  this  report  that  moving  to  a  larger  array  format,  such  as  a  256  by 
256  array  would  still  be  plausible  on  this  FPGA.  If  all  resources  were  assumed  to 
double  with  the  array  size  the  FPGA  would  still  have  resources  to  spare.  However, 
further  investigation  would  need  to  be  done  for  the  timing  performance. 

The  SWaP  of  the  processing  component  of  the  overall  system  can  now  be  confi¬ 
dently  reduced  while  still  maintaining  system  performance.  As  seen  in  Table  2,  there 
is  a  3*107x  reduction  from  the  current  state  of  the  art  (MPSCP)  processing  SWaP  to 
the  new  “micro-ladar”  implementation. 


Table  2.  PixelCP  vs.  MPSCP  SWaP 


Algorithm 

Size 

Weight 

Power 

PixelCP 

20  in3 

<1  lbs 

~6  W 

MPSCP 

4107  in3 

2701bs 

4000W 

5  Conclusion  and  Future  Work 

In  this  paper  we  developed  a  coincidence  processing  algorithm  to  create  high  reso¬ 
lution,  FOPEN  imagery  on  low  SWaP  Geiger-mode  ladar  systems  using  FPGAs.  We 
implemented  a  multi-peak  and  convolution  version  of  PixelCP  on  a  FPGA.  We  exam¬ 
ined  the  performance  of  these  algorithms  when  compared  to  their  Matlab  counter¬ 
parts.  Additionally  determined  their  ability  to  operate  with  real  time  performance  and 
presented  the  feasibility  of  implementing  these  algorithms  on  existing  FPGA  technol¬ 
ogy.  Finally,  we  developed  an  algorithm  that  met  all  our  system  requirements. 

Future  work  involves  increasing  the  clock  speed  of  the  FPGA  in  order  to  reduce 
multi-peak  timing.  Additionally,  a  larger  array  can  be  implemented  to  cover  larger 
areas  or  increase  resolution.  Lastly,  a  VHDL  implementation  of  the  convolution  algo¬ 
rithm  with  simulated  timing  needs  to  be  developed  to  further  identify  its  validity  on 
an  FPGA. 
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